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We  consider  the  application  of  domain  decomposition  techniques  to  the  solution  of  sparse  linear 
systems  arising  from  implicit  PDE  discretizations  on  parallel  computers.  Representatives  of  two 
popular  MIMD  architectures,  message  passing  (the  Intel  iPSC/2-SX)  and  shared  memory  (the 
Encore  Multimax  320),  are  employed.  We  run  the  same  numerical  experiments  on  each,  namely 
stripwise  and  boxwise  decompositions  of  the  unit  square,  using  up  to  64  subdomains  and  containing 
up  to  64K  degrees  of  freedom.  We  produce  a  tight-fitting  complexity  model  for  the  former  and 
discuss  the  difficulty  of  doing  so  for  the  latter.  We  also  evaluate  which  of  three  types  of  domain 
decomposition  preconditioners  that  have  appeared  in  the  literature  of  self-adjoint  elliptic  problems 
are  most  efficient  in  different  regions  of  machine-problem  parameter  space.  Some  form  of  global 
sharing  of  information  in  the  preconditioner  is  required  for  efficient  overall  parallel  implementation 
in  the  region  of  most  practical  interest  (large  problem  sizes  and  large  numbers  of  processors); 
otherwise,  an  increasing  iteration  count  inveighs  against  the  gains  of  concurrency.  Our  results 
on  a  per  iteration  basis  also  hold  for  sparse  discrete  systems  arising  from  other  types  of  oartial 
differential  equations,  but  in  the  absence  of  a  theory  for  the  dependence  of  the  convergence  rate 
upon  the  granularity  of  the  decomposition,  the  overall  results  are  only  suggestive  for  more  general 
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1.  Introduction 


Domain  decomposition  techniques  appear  to  be  a  natural  way  to  distribute  the  solution  of 
i  large  sparse  linear  systems  across  many  parallel  processors.  In  this  paper  we  develop  complexity 

estimates  for  two  types  of  decompositions  and  two  parameterized  types  of  “real”  parallel  computer 
architectures,  and  validate  those  estimates  on  representative  machines,  with  particular  emphasis 
»  on  the  case  of  large  numbers  of  processors  and  large  problems.  We  examine  the  tradeoffs  between 

various  forms  of  preconditioning,  as  characterized  by  the  efficiency  of  their  parallel  implementation. 

Parallel  computers  may  be  divided  into  two  broad  classes:  distributed  memory  and  shared 
memory.  In  a  distributed  memory  parallel  processor,  each  processor  h  uo  its  own  memory  nnd 
no  direct  access  to  memory  on  any  other  processor.  Such  machines  are  usually  termed  “message 
passing”  computers  since  interprocessor  communication  is  accomplished  through  the  sending  and 
receiving  of  messages.  In  a  shared  memory  parallel  processor,  each  processor  has  direct,  ran¬ 
dom  access  to  the  same  memory  space  as  every  other  processor.  Interprocessor  communication  is 
conducted  directly  in  the  shared  memory.  In  practice,  of  course,  most  shared  memory  machines 
have  local  memory,  called  the  cache,  and  communication  is  through  messages,  called  cache  faults. 
However,  each  type  of  parallel  processor  is  optimized  for  a  different  interprocessor  communication 
pattern,  and  we  consider  the  effects  of  these  optimizations  on  domain  decomposition. 

Domain  decomposition  refers  in  a  generic  vay  to  the  replacement  of  a  partial  differential 
equation  problem  defined  over  a  global  domain  with  a  series  of  problems  over  subdomains  which 
collectively  cover  the  original.  Early  domain  decomposition  techniques,  whether  iterative  [12]  or 
,  direct  [10],  were  baser  on  exact  reductions  of  the  global  problem  to  a  set  of  lower-dimensional 

problems  on  interfaces  between  subdomains  by  means  of  direct  elimination  of  the  degrees  of  freedom 
interior  to  the  subdomains.  In  this  sense,  domain  decomposition  is  analogous  to  the  finite  element 
^  procedure  of  static  condensation.  A  more  modern  viewpoint  has  emerged  [2,  13]  in  which  the 

interior  unknowns  are  not  directly  eliminated,  but  retained  in  the  outer  iteration,  preconditioned 
by  an  appropriate  fast  approximate  solver.  In  [8],  we  referred  to  such  approaches  as  “partitioned 
matrix  methods”,  and  it  is  such  that  we  consider  here,  in  combination  with  preconditioned  conjugate 
gradients  as  the  outer  iterative  method.  Partitioned  matrix  iteration  was  shown  in  [8]  to  be  identical 
to  iteration  on  the  reduced  system  when  the  subdomain  solves  are  exact.  However,  it  has  the 
advantage  of  being  much  more  flexible  for  problems  for  which  the  only  known  exact  solvers  are 
expensive.  The  important  question  as  to  how  much  approximate  subdomain  solves  “penalize 
overall  iterative  convergence  has  begun  to  be  addressed  (see,  in  addition  to  the  references  above. 
[1]  and  section  6  of  [4]). 

An  example  of  the  simplest  form  of  domain  decomposition  is  shown  in  Figure  la.  A  single 
interface  (3)  divides  a  domain  into  subdomains  1  and  2.  The  matrix  obtained  upon  finite  differ¬ 
ence  or  finite  element  discretization  is  partitioned  according  to  the  physical  decomposition  into 
subdomains  connected  by  a  small  (lower  dimension)  interface  region  and  looks  like 
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where  An  and  A22  come  from  the  interior  of  the  subdomains.  A 33  from  along  the  interface,  and 
A 13  and  A23  from  the  interactions  between  the  subdomains  and  the  interfaces.  A  detailed  view  of 
this  matrix  for  the  5-point  operator  is  shown  in  Figure  2. 

T.'.e  geiw.Jization  of  Figure  i<t  to  any  number  of  strips  is  straightforward.  For  reasons  that 
will  become  clear,  it  is  important  to  also  consider  a  decomposition  into  boxes  as  shown  in  Figure  11). 
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The  matrix  for  this  decomposition  looks  like 


« 


/  Ai  Aib  0  \ 
.4C  =  AjB  Ab  Abc  1  , 
\  0  abc  ac  j 


where 


A[  =  Block  diagonal  subdomain  matrices 
Aib  —  Coupling  to  interiors  from  subdomain  interfaces 
Ab  =  Block  diagonal  subdomain  interfaces 
Abc  —  Coupling  to  interfaces  from  cress  points 
Ac  =  Cross  point  system 


This  representation  is  based  on  a  five-point  difference  template,  so  that  the  coupling  matrix  A/c 
is  zero.  More  generally,  nonzero  coupling  between  the  crosspoints  and  the  corner  points  of  the 
subdomain  interiors  would  need  to  be  taken  into  account.  A  detailed  view  of  this  matrix  for  the 
5-point  operator  is  shown  in  Figure  3,  for  the  case  of  a  decomposition  into  two  by  two  blocks.  We 
are  interested  in  various  preconditioners  for  both  As  and  ,4C,  based  on  their  efficacy  and  on  their 
parallel  limitations. 

The  choice  of  preconditioner  is  critical  in  domain  decomposition,  as  with  any  iterative  method. 
In  the  context  of  parallel  computing,  the  main  distinction  is  between  preconditioners  which  are 
purely  local,  those  ,iic!,  involve  neighbor  communication,  and  those  which  involve  global  commu¬ 
nication.  (We  use  tti  ;d  “communication”  here  in  a  general  sense;  in  a  shared  memory  machine, 
this  refers  tc  shared  access  to  memory.)  As  example  preconditioners  we  consider  a  block  diagonal 
matrix  for  the  purely  local  case  and  a  preconditioner  based  on  FFT  solves  along  the  interfaces  for 
the  neighbor  communication  case.  As  an  example  involving  global  communication,  we  consider  the 
Bramble  et  al  preconditioner  [2],  which  requires  the  solution  of  a  linear  system  for  the  cross  points 
(or  vertices).  This  method  involves  only  low  bandwidth  global  communication  (that  is.  the  size  of 
the  messages  scales  with  the  number  of  processors);  we  do  not  consider  any  method  which  uses 
high  bandwidth  communication  (where  message  size  scales  with  the  size  of  the  problem). 

We  develop  a  complexity  model  for  each  type  of  parallel  computer  that  is  based  on  two  major 
contributions:  floating  point  work  and  "shared  memory  access”.  This  latter  term  measures  the 
cost  of  communicating  information  between  processors.  In  a  distributed  memory  system,  this  is 
represented  by  a  communication  time.  In  a  shared  memory  system,  there  are  several  contributions, 
including  cache  size  and  bandwidth,  and  the  nutnbpr  of  simultaneous  memory  requests  which  may 
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Figure  1:  Two  forms  of  decompositions.  A  two  domain  strip 
decomposition  is  shown  in  (a);  a  many  domain  box  decom¬ 
position  with  cross  points  (labeled  UC”)  is  shown  in  (b). 


Figure  2:  The  partitioning  of  the  matrix  ,4S  for  the  5-point 
Laplacian  and  two  strips.  The  dashed  lines  separate  subdo¬ 
mains;  the  solid  lines  separate  the  interior  unknowns  from 
the  interface  unknowns. 

be  served. 

2.  Comments  on  Parallelism  Costs 

In  any  parallel  algorithm,  there  are  a  number  of  different  costs  to  consider.  The  most  obvious 
of  these  are  intrinsically  serial  computations.  For  example,  the  dot  products  in  the  conjugate 
gradient  method  involve  the  reduction  of  values  to  a  single  sum;  this  takes  at  least  logp  time.  More 
subtle  are  costs  from  the  implementation,  both  software  and  hardware.  An  example  of  the  soft  ware 


Figure  3:  The  partitioning  of  the  matrix  Ac  for  the  5-point 
Laplacian  and  four  boxes  arranged  ir.  a  twc  by  two  manner. 

The  dashed  lines  separate  subdomains;  the  solid  lines  sepa¬ 
rate  the  interior  unknowns  from  the  interface  unknowns  and 
the  cross  point  unknown. 

cost  is  the  need  to  guarantee  safe  access  to  shared  data;  this  is  often  handled  with  barriers  or 
more  general  critical  regions.  Sample  hardware  costs  include  bandwidth  limits  in  shared  resources 
such  as  memory  buses  and  startup  and  transfer  speeds  in  communication  links.  Perhaps  the  most 
subtle  cost  iies  in  algorithmic  changes  to  “improve”  parallelism;  by  choosing  a  poor  algorithm 
over  another,  less  parallel  algorithm,  artificially  good  parallel  efficiencies  can  be  found.  We  call  a 
high  parallel  efficiency  “artificial”  if  there  exists  an  algorithm  with  lower  parallel  efficiency  which 
nevertheless  executes  in  less  wall-clock  time  on  a  given  number  of  processors.  Of  the  many  examples 
in  the  literature,  one  of  the  most  dramatic  is  [5]  on  computing  the  forces  in  the  rc-body  potential 
problem;  the  naive  algorithm  is  almost  perfectly  parallel  but  substantially  slower  than  the  linear- 
in-n  algorithm,  which  contains  some  reductions  and  hence  some  intrinsically  serial  computations. 

We  can  identify  the  following  costs  in  the  domain  decomposition  algorithms  that  we  are  con¬ 
sidering: 

•  Dot  (inner)  products.  These  involve  a  reduction,  and  hence  at  least  logp  time:  in  addition, 
there  may  be  some  critical  sections  (depending  on  the  implementation). 

•  Matrix- vector  products.  These  involve  shared  data,  and  hence  may  introduce  some  constraints 
on  shared  hardware  resources. 

•  Preconditioner  solves.  These  depend  on  the  preconditioner  chosen,  and  hence  gives  us  the  most 
freedom  in  trading  off  greater  parallelism  against  superior  algorithmic  performance. 

Note  that  the  sharing  of  data  is  not  random;  most  of  the  data  sharing  occurs  between  neighbors. 


2.1.  Message  passing  and  Shared  memory  models 

Two  methods  for  achieving  parallelism  in  computer  hardware  for  MIMD  machines  are  message 
passing  and  shared  memory.  In  both  of  these,  the  software  and  hardware  costs  discussed  above 
show  up  in  the  cost  to  access  shared  data.  Each  of  these  methods  is  optimized  for  a  different 
domain,  and  these  optimizations  are  reflected  in  the  actual  costs.  In  the  following,  to  simplify  the 
notation  will  will  express  all  times  in  terms  of  the  time  to  do  a  floating  point  operation.  Further, 
we  will  drop  constant  factors  from  our  estimates. 

In  a  message  passing  machine,  each  processor  has  some  local  memory  and  a  set  of  communica¬ 
tion  links  to  some  (usually  not  all)  other  processors  (called  nodes).  Each  processor  has  access  only 
to  its  own  local  memory.  Communication  of  shared  data  is  handled  (usually  by  the  programmer) 
by  explicitly  delivering  data  over  the  communication  links.  This  takes  time  s  +  rn  for  n  words, 
where  s  is  a  startup  time  (latency)  and  r  is  the  time  to  transfer  a  single  word.  This  is  good  for 
local  or  nearest  neighbor  communication.  For  more  global  communication  (such  as  a  dot  product), 
times  depend  on  the  interconnection  network.  For  a  hypercube,  the  global  time  is  (s  +  rn)  log/;: 
for  a  mesh,  it  is  (s  +  rn)^p. 

In  a  shared  memory  machine,  each  processor  may  directly  access  a  shared  global  memory. 
Communication  (access  to)  shared  data  is  handled  by  simply  reading  the  data.  However,  the 
actual  implementation  of  this  introduces  a  number  of  limits.  For  example,  if  the  memory  is  on  a 
common  bus,  then  there  is  a  limit  to  the  number  of  processors  that  can  simultaneously  read  from 
the  shared  memory.  One  way  to  model  this  cost  is  as  1  +  p/  min(p,  P)  [7],  Here  p  is  the  number  of 
processors  and  P  is  the  maximum  number  of  processors  that  may  use  the  resource  at  one  time. 

In  addition,  the  access  tc  the  shared  data  must  be  controlled;  this  can  add  additional  costs 
in  terms  of  barriers  or  critical  sections.  These  can  add  additional  terms  which  are  proportional  to 
logp. 

3.  Complexity  Estimates 

We  can  estimate  the  computational  complexity  for  these  two  models  for  several  forms  of  domain 
decomposition.  We  note  that  these  are  rather  rough  estimates,  good  (because  of  their  generality) 
for  identifying  trends.  Additional  estimates  of  the  complexity  of  parallel  domain  decomposition 
may  be  found  in  [6]. 

3.1.  Message  Passing 

In  this  case  we  can  easily  separate  out  the  computation  terms  and  the  communication  terms. 
For  each  part  of  the  algorithm,  we  will  place  a  computation  term  above  the  related  communication 
term.  In  the  formulas  below,  the  constants  in  front  of  each  term  have  been  dropped  for  clarity. 

In  two  dimensions,  with  n  unknowns  per  side,  we  have  for  strips 
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These  costs  are  all  per  iteration.  It  is  assumed  that  all  neighbor-neighbor  interactions  can  occur 
simultaneously.  In  the  box  case,  the  vertex  coupling  equations  are  solved  by  first  exchanging 
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all  of  the  vertex  data  with  all  the  processors.  Then  each  processor  computes  the  entire  ver'ex 
system.  For  the  small  systems  considered,  this  is  an  efficient  way  to  handle  the  vertex  system.  A 
more  cooperative  parallel  solution  approach  would  incur  more  communication  startups  and  could 
actually  take  longer  [  1 1] .  (A  complexity  comparison  appears  under  #3  below.) 

3.2.  Shared  Memory 

In  this  case,  a  detailed  formula  depends  on  the  specific  design  tradeoffs  made  in  the  hardware. 
The  formula  here  applies  to  bus-oriented  shared  memory  machines;  a  different  formula  would 
be  needed  for  machines  like  the  BBN  Butterfly.  These  formulas  are  dominated  by  bandwidth 
limitations  (the  min(p,  P )  terms)  and  barrier  or  synchronization  costs  (the  logp  terms). 

For  strips,  we  have 


tv  , 
—  I  a 


bp 


min(p,  Pi) 


+  2  (  —  T  log  p  j  +2 


(7-;)(‘  + 


dp 


min(p,  Pi) 


+  n  log  n  +  3  log  p. 


A  similar  formula  holds  for  boxes.  Here,  a,  b ,  c,  d ,  Pi,  and  Pi  are  all  constants  that  depend  on 
the  particular  hardware  and  implementation.  P,  give  a  limit  on  the  number  of  processors  that  can 
effectively  share  a  hardware  resource.  The  ratios  a/b  and  c/d  reflect  the  ratio  of  local  work  to  use 
of  the  shared  resource  (such  as  memory  banks  or  memory  bus). 

3.3.  Implications 

In  domain  decomposition  algorithms,  we  can  trade  iteration  count  against  work  and  parallel 
overhead.  We  will  consider  three  representative  tradeoffs: 

1.  No  communication.  This  amounts  to  diagonal  preconditioning.  Call  the  number  of  iterations 
^decoupled' 

2.  Local  communication  only.  The  FFT-implementable  '‘A'1/2”  preconditionings  such  as  those  in 
[2],  where  we  can  expect  the  iteration  count  to  be  proportional  to  p  for  strips  and  ^Jp  for  boxes. 
Call  the  number  of  iterations  /jocaj. 

The  cost  of  the  “A-1/2”  and  all  more  implicit  forms  of  preconditioning  includes  an  extra  sub- 
domain  solve  to  symmetrize  the  preconditioned  and  is  roughly 


n  n 

n  log  n  -| - log  — |-  2(.s  +  rn) 

V  P 

for  a  strip  decomposition  and  a  message  passing  system.  The  preconditioning  complexity 
estimate  is  dominated  by  the  subdomain  solve  terms,  and  it  is  thus  roughly  twice  that  of 
case  1.  The  communication  costs  are  the  same  as  for  the  matrix- vector  product.  Thus  the 
local  communication  preconditioning  is  effective  if 


“^local  —  ^ 


decoupled ■ 


3.  Global  communication.  In  the  case  of  box-wise  decompositions,  cross  points  occur  at  the 
intersections  of  the  interfaces.  The  cross-points  form  a  global  linear  system  that  is  discussed 
in  [2].  The  iteration  count  is  approximately  independent  of  n  as  p  varies;  call  it  global ‘ 

In  this  case,"  the  additional  costs  over  and  above  the  symmetrization  stage  mentioned  in  #2  are 
those  of  communicating  the  entries  in  the  cross-point  problem  and  its  solving  it.  For  the  case 
of  a  message  passing  system,  we  can  do  this  in  p2  +  (.?  +  r)logp  time  if  each  processor  solves 
the  cross-point  system  and  in  time  p3/2  +  p[s  -(-  r^/p)  time  if  a  parallel  algorithm  is  used  for  the 
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solve,  assuming  a  straightforward  approach  based  on  Banded  Gaussian  elimination  on  a  ring. 
More  sophisticated  approaches  for  hypercubes  which  are  p  +  (s  +  r)  logp  are  known  [3.  1  lj. 

Comparing  this  to  the  cost  of  the  local  computation  of  (n2/p) log(n/v/p)  +  (s  +  n)logp.  the 
floating  point  work  is  negligible  unless  p  is  approximately  equal  to  n,  or  higher.  Assuming  then 
that  we  use  the  local  method,  then  the  additional  cost  of  the  global  communication  makes  the 
full  cross-point  method  better  whenever 

^global -  e)  -  ^local’ 

where  e  is  the  parallel  efficiency,  equal  to  1  minus  the  ratio  of  communication  time  to  compu¬ 
tation  time. 

For  the  shared  memory  case,  if  barriers  and  the  dot  product  reduction  are  the  dominant 
parallelism  costs,  the  result  is  similar. 

4.  Experiments 

The  standard  test  problem  considered  was 

V2u  =  g 

where  g  -  32( x(  1  -  x)  +  y(l  -  y ))  on  the  unit  square.  Though  rather  ideal,  to  minimize  coefficient 
storage  space,  this  problem  has  structural  (from  the  graph-theoretic  viewpoint)  and  spectral  (from 
the  operator-theoretic  viewpoint)  similarities  to  the  generic  self-adjoint  elliptic  problem  and  also 
to  the  non-self-adjoint  problem  in  the  limit  of  sufficiently  high  mesh  refinement.  The  first  set  of 
experiments  was  conducted  on  an  Encore  Multimax  320  shared  memory  parallel  computer  with  18 
processors;  we  used  only  16,  allowing  the  remaining  2  processors  to  handle  various  system  functions. 
The  experiments  on  the  Encore  Multimax  were  done  in  double  precision.  The  results  are  shown 
in  Tables  1  and  2.  The  Encore  is  a  time-sharing  machine,  so  these  timings  are  accurate  and  have 
been  reproduced  only  to  about  10%.  The  computations  were  performed  with  no  other  users  on  the 
machine;  however,  various  system  programs  (mailers,  network  daemons)  used  some  resources.  In 
addition,  even  a  programmer  who  fully  understands  the  dependencies  in  a  given  problem  can  not 
force  each  process  to  run  on  a  different  processor;  operating  system  logic  intervenes. 

The  tables  show  the  iteration  count  /,  an  estimate  oi  the  condition  number  k  (estimated  as  in 
[9]),  the  time  in  seconds  T,  and  the  relative  speedup  a.  The  relative  speedup  is  defined  as  the  ratio 
of  the  time  from  the  previous  column  with  the  time  from  the  current  column.  The  times  do  not 
include  initial  setup.  While  this  slightly  distorts  the  total  time,  it  does  allow  the  time  per  iteration 
to  be  determined  by  dividing  the  time  by  ihe  iteration  count 

The  next  set  of  experiments  was  run  on  a  64  node  Intel  iPSC/2-SX  Ilypercube,  with  1 
Megabytes  of  memory  on  each  node  and  the  SX  floating  point  accelerator  (a  scalar  floating  point 
accelerator).  All  runs  were  in  single  precision  to  allow  a  large  problem  to  fit  in  this  memory  space. 
The  results  are  shown  in  Tables  3-6. 

The  programs  in  both  of  these  cases  were  nearly  the  same.  Only  the  code  dealing  with  shared 
data  was  changed  to  use  either  messages  or  shared  memory.  (The  Encore  implementation  is  not  a 
message  passing  implementation  using  the  shared  memory  to  simulate  message;  it  is  a  “natural" 
shared  memory  code.  In  particular,  the  solution  arrays  are  all  shared.  The  Intel  code  is  a  “natural" 
message  passing  code.  The  solution  arrays  are  all  local  (private).) 

There  are  some  differences  between  the  iteration  count  results  for  the  Encore  and  the  Intel 
implementations.  These  differences  seem  to  be  due  to  the  difference  in  floating  point  arithmetic  on 
the  two  machines.  Double  precision  was  required  on  the  Encore  to  get  the  fast  Poisson  solver  we 
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Table  1: 

Results  for  strips  on  the  Encore  Multimax  320. 
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Table  2:  Results  for  boxes  on  the  Encore  Multimax  320. 
using  full  vertex  coupling. 
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Table  5:  Results  for  boxes  on  the  Intel  Hypercube,  without 
vertex  coupling  but  with  interface  preconditioning  (neighbor 
coupling).  The  "X  A”  entries  could  not  be  computed  on  out 
iPSC/2. 


Two  of  the  full  vertex  coupling  (global)  results  show  superlinear  relative  speedup.  1  hi'  is 
a  real  effect,  which  derives  from  the  superlinear  growth  in  the  cost  of  solving  a  single  domain 
as  a  mnction  of  the  number  of  points  in  the  domain.  This  speedup  is  of  course  available  to  a 
single  processor  domain  decomposition  algorithm.  In  fact,  the  slight  relative  speedups  seen  for  'he 
strip  decompositions  are  due  almost  entirely  to  this  effect  alone,  since  the  number  of  iteration'  is 
proportional  to  the  number  of  processors. 

5.1.  Comparison  with  the  theory 

In  the  case  of  the  message  passing  results  (Intel  Hypercube),  it  is  possible  to  fit  the  t Imon" mal 
complexity  estimate  to  the  measured  times.  Taking  only  the  highest  order  terms  in  the  laimic'.  •> 
and  the  arithmetic  suggests  a  fit  for  the  strip  decomposition  of 

n*  n*  .  n  , 

<i\ - b  a2 —  log  -  +  +  a.,  log  p. 

P  P  P 

We  have  ignored  the  r  terms  because  .s  >  rn  for  given  n  on  the  Intel  hypercube.  A  least  >rj >i 

fit  to  the  data  yields  «,  =  0.00013,  a2  =  0.000090.  a2  =  0.000027.  and  a4  =  0.0077.  . .  fit- 

were  computed  by  scalirig  the  equations  by  the  inverse  of  the  time,  making  the  least  squares  v  m 

Afu  =  (l.l . l)r.  The  residual  ||  ,Uu —  ( 1 . 1  l^llc  is  9.0  1-1.  The  fit  is  shown  in  Future  1.  W  '  7 

these  values  for  directly  from  the  data),  the  efficiency  p<  r  it r ration  ran  be  shown  to  be  around  ' or 
for  the  larger  problems. 

Some  care  is  necessary  in  interpreting  the  graohs  in  Figure  1.  In  particular,  the  mea-up’  ■'! 
efficiency  used  in  the  figure  is  relative  to  the  single  subdomain  case  for  a  single  iteration.  Mecau-e  '  F, ■  • 
cost  of  solving  on  a  subdomain  falls  faster  than  linearly  with  increasing  numbers  of  subdomain', 
this  measure  of  efficiency  will  often  be  greater  than  one.  An  estimate  of  the  parallel  efficiency 
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Figure  4:  Data  for  iPSC/2  runs  with  stri^  decomposition. 
The  fit  is  made  using  Equation  (5.1).  The  y  axis  is  the  ef¬ 
ficiency,  defined  as  the  time  on  a  single  processor,  divided 
by  the  number  of  processors  times  the  time  on  that  many 
processors. 


(computation  time  divided  by  total  time)  can  be  made  from  Equation  (5.1): 

di—  +<z2^  log* 

£  __  _ 1  P  L  V  &  V _ 

aiy  +  a2I~  logs  +  a3  +  a4  log  p 

Figure  5  shows  the  behavior  of  E  with  respect  to  p  for  a  sample  value  of  n. 

For  the  box  decompositions,  the  highest  order  terms  in  the  complexity  model  are 


(5.2) 


(5.3) 


We  have  again  ignored  the  r  terms  because  of  the  high  latency  of  the  Intel  hypercube.  A  least 
squares  fit  to  the  data  yields  aj  =  0.00012,  a2  =  0.000091.  a3  =  0.000014,  n4  =  0.0067.  and 
a.5  =  0.0031.  The  coefficients  were  computed  in  the  same  way  as  for  strips,  and  the  residual  is 
0.032.  A  graph  of  the  data  and  the  fit  is  shown  in  Figure  6.  The  u3  term  comes  from  the  arithmetic 
complexity  of  solving  the  global  cross-point  system  on  a  single  processor  using  an  already  factored 
matrix.  The  efficiency  per  iteration  is  lower,  because  of  the  increased  communication  overhead, 
but  is  still  above  70%  for  the  larger  problems.  This  makes  the  strip  decomposition  superior  for 
moderate  numbers  of  processors  (where  the  iteration  counts  are  similar). 

To  get  a  better  idea  for  the  individual  overheads,  the  program  for  the  Intel  Hypercube  was 
instrumented  to  provide  data  on  the  cost  of  each  communication  operation.  This  data  showed 
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Table  6:  Results  for  boxes  on  the  Intel  Hypercube,  using 
diagonal  blocks  only  (no  coupling.) 


Figure  5:  Estimated  parallel  efficiency  for  the  strips  case  on 
the  iPSC/2,  using  Equation  (5.2),  n  =  255,  and  the  parame¬ 
ters  in  the  text. 


that  the  expense  of  the  global  communication  used  to  form  the  vertex  coupling  system  was  always 
smaller  than  the  communication  to  form  the  dot-products,  and  smaller  in  all  but  one  case  than 
the  nearest-neighbor  communication  used  in  forming  the  matrix-vector  product  Ax.  This  result  is  < 

consistent  with  the  known  communication  behavior  of  the  Intel  Hypercube.  The  two  dot  products 
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Figure  6:  Data  for  iPSC/2  runs  with  full  vertex  coupling. 
The  fit  is  made  using  Equation  (5.3).  The  y  axis  is  the  ef¬ 
ficiency,  defined  as  the  time  on  a  single  processor,  divided 
by  the  number  of  processors  times  the  time  on  that  many 
processors. 


done  in  each  iteration  involve  a  time  2(s  4-  r)logp  while  the  global  exchange  involves  (roughly) 
slogp  +  pr\  since  s  >  r,  the  global  exchanges  take  less  time.  In  the  case  of  the  nearest-neighbor 
communications,  the  comparison  is  with  4(s  +  rn/^/p).  The  larger  amount  of  data  moved  in  ihe 
neighbor  computation  makes  it  often  more  expensive  than  the  global  communication  for  the  vertex 
coupling.  Further,  when  n  is  small,  we  would  expect  the  neighbor  communication  to  be  faster  than 
the  global  communication,  and  this  is  exactly  what  is  observed.  A  different  balance  of  .s  and  r. 
or  significantly  more  processors  would  of  course  change  these  results.  For  example,  on  a  truely 
massively  parallel  machine  with  thousands  of  processors,  the  neighbor  communication  would  be 
smaller  than  either  the  dot-product  or  global  communication. 

This  leads  to  the  results  in  Tables  5  and  6,  where  a  simpler  communication  strategy  has  been 
traded  against  larger  iteration  counts. 

For  the  shared  memory  machine,  the  complexity  estimates  are  harder  to  demonstrate.  In  part, 
this  is  due  to  the  design  of  the  shared  memory  machines;  the  number  of  processors  is  deliberately 
limited  to  roughly  what  the  hardware  (i.e.,  memory  bus)  can  support.  The  dominant  effect  is 
usually  load  balancing  or  the  intrinsically  serial  parts  of  the  computation  (synchronization  points 
and  dot  products). 

The  results  for  the  Intel  Hypercube  are  summarized  in  Table  7  which  gives  the  optimal  choice 
of  domain  decomposition  algorithm  (from  among  those  considered)  for  various  values  of  p  and  h  for 
the  Intel  Hypercube.  As  either  the  number  of  processors  or  the  number  of  mesh  points  increase.  I  he 
global  or  full  vertex  coupling  algorithm  becomes  more  efficient,  despite  its  additional  communicat  ion 
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Table  7:  Optimal  choice  of  algorithm  for  the  given  problem 
and  implementation  on  the  Intel  Hypercube,  from  the  choices 
of  decoupled  block  diagonal,  locally  coupled  interfaces,  and 
full  vertex  coupling  preconditioners  for  the  box  decomposi¬ 
tion. 

demands.  These  results  emphasize  the  importance  of  considering  a  wide  range  of  decompositions 
and  problem  sizes.  For  example,  the  results  for  p  =  4  can  give  a  misleading  picture  of  the  usefulness 
of  both  the  fully  decoupled  and  the  locally  coupled  interface  preconditioners.  In  fact,  the  “local’' 
preconditioning  wins  by  an  extremely  narrow  margin  in  the  two  cases  in  which  it  beats  out  “global". 
Our  results  show  that,  even  for  relatively  small  problems,  the  asymptotically  superior  performance 
(in  iteration  count)  of  the  full  vertex  coupling  preconditioners  more  than  compensates  for  the  the 
additional  cost  of  the  communication.  This  is  true  even  for  a  machine  such  as  the  Intel  iPSC/2 
Hypercube,  with  relatively  slow  communication  and  global  communication  that  is  proportional  to 
logp.  The  older  iPSC/1  Hypercube  had  an  identical  table,  which  is  not  surprising,  given  that  the 
ratio  of  communication  speed  to  computation  speed  for  the  two  machines  is  similar. 
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